机器的图像编码(ICM)旨在压缩图像进行AI任务分析,而不是满足人类的看法。学习一种既是一般(用于AI任务)的特征,也是紧凑的(用于压缩)的功能,这对于其成功而言至关重要。在本文中,我们试图通过学习通用功能,同时考虑压缩来开发ICM框架。我们将诸如无所不能功能和相应框架的功能命名为Omni-ICM。考虑到自我监督学习(SSL)提高了特征的概括,我们将其与压缩任务集成到OMNI-ICM框架中,以学习无所不能的功能。但是,在SSL中协调语义建模并在压缩中删除冗余是不平凡的,因此我们通过合作实例区分和熵最小化以自适应掉落的信息来设计新颖的信息过滤(如果)模块,以较弱相关的信息执行AI任务(例如,某些纹理冗余)。与以前的特定解决方案不同,Omni-ICM可以直接基于学习的无能功能的AI任务分析,而无需联合培训或额外的转换。尽管简单而直观,但Omni-ICM在多个基本愿景任务上大大优于现有的传统和基于学习的编解码器。
translated by 谷歌翻译
在本文中,我们介绍了第一个神经视频编解码器,可以在用于低延迟模式的UVG数据集上的SRGB PSNR方面与最新编码标准H.266 / VVC竞争。现有的神经混合视频编码方法依赖于用于预测的光流或高斯尺度流,这不能支持对不同运动内容的细粒度适应性。为了更具内容 - 自适应预测,我们提出了一种新颖的跨尺度预测模块,实现更有效的运动补偿。具体地,一方面,我们生产参考特征金字塔作为预测源,然后传输利用特征尺度的横级流来控制预测的精度。另一方面,我们将加权预测的机制介绍到具有单个参考帧的预测场景的机制,其中发送交叉尺度权重映射以合成精细预测结果。除了串尺度预测模块之外,我们还提出了一种多级量化策略,这提高了在推理期间没有额外计算惩罚的速率失真性能。我们展示了我们有效的神经视频编解码器(ENVC)对几个常见的基准数据集的令人鼓舞的表现,并详细分析了每个重要组成部分的有效性。
translated by 谷歌翻译
学习的视频压缩方法在赶上其速率 - 失真(R-D)性能时,追赶传统视频编解码器的许多承诺。然而,现有的学习视频压缩方案受预测模式和固定网络框架的绑定限制。它们无法支持各种帧间预测模式,从而不适用于各种场景。在本文中,为了打破这种限制,我们提出了一种多功能学习的视频压缩(VLVC)框架,它使用一个模型来支持所有可能的预测模式。具体而言,为了实现多功能压缩,我们首先构建一个运动补偿模块,该模块应用用于在空间空间中的加权三线性翘曲的多个3D运动矢量字段(即,Voxel流量)。 Voxel流量传达了时间参考位置的信息,有助于与框架设计中的帧间预测模式分离。其次,在多参考帧预测的情况下,我们应用流预测模块以预测具有统一多项式函数的准确运动轨迹。我们表明流量预测模块可以大大降低体素流的传输成本。实验结果表明,我们提出的VLVC不仅支持各种设置中的多功能压缩,而且还通过MS-SSIM的最新VVC标准实现了可比的R-D性能。
translated by 谷歌翻译
Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译
In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.
translated by 谷歌翻译
Supervised Deep-Learning (DL)-based reconstruction algorithms have shown state-of-the-art results for highly-undersampled dynamic Magnetic Resonance Imaging (MRI) reconstruction. However, the requirement of excessive high-quality ground-truth data hinders their applications due to the generalization problem. Recently, Implicit Neural Representation (INR) has appeared as a powerful DL-based tool for solving the inverse problem by characterizing the attributes of a signal as a continuous function of corresponding coordinates in an unsupervised manner. In this work, we proposed an INR-based method to improve dynamic MRI reconstruction from highly undersampled k-space data, which only takes spatiotemporal coordinates as inputs. Specifically, the proposed INR represents the dynamic MRI images as an implicit function and encodes them into neural networks. The weights of the network are learned from sparsely-acquired (k, t)-space data itself only, without external training datasets or prior images. Benefiting from the strong implicit continuity regularization of INR together with explicit regularization for low-rankness and sparsity, our proposed method outperforms the compared scan-specific methods at various acceleration factors. E.g., experiments on retrospective cardiac cine datasets show an improvement of 5.5 ~ 7.1 dB in PSNR for extremely high accelerations (up to 41.6-fold). The high-quality and inner continuity of the images provided by INR has great potential to further improve the spatiotemporal resolution of dynamic MRI, without the need of any training data.
translated by 谷歌翻译
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.
translated by 谷歌翻译
The objective of this paper is to learn dense 3D shape correspondence for topology-varying generic objects in an unsupervised manner. Conventional implicit functions estimate the occupancy of a 3D point given a shape latent code. Instead, our novel implicit function produces a probabilistic embedding to represent each 3D point in a part embedding space. Assuming the corresponding points are similar in the embedding space, we implement dense correspondence through an inverse function mapping from the part embedding vector to a corresponded 3D point. Both functions are jointly learned with several effective and uncertainty-aware loss functions to realize our assumption, together with the encoder generating the shape latent code. During inference, if a user selects an arbitrary point on the source shape, our algorithm can automatically generate a confidence score indicating whether there is a correspondence on the target shape, as well as the corresponding semantic point if there is one. Such a mechanism inherently benefits man-made objects with different part constitutions. The effectiveness of our approach is demonstrated through unsupervised 3D semantic correspondence and shape segmentation.
translated by 谷歌翻译
Patients take care of what their teeth will be like after the orthodontics. Orthodontists usually describe the expectation movement based on the original smile images, which is unconvincing. The growth of deep-learning generative models change this situation. It can visualize the outcome of orthodontic treatment and help patients foresee their future teeth and facial appearance. While previous studies mainly focus on 2D or 3D virtual treatment outcome (VTO) at a profile level, the problem of simulating treatment outcome at a frontal facial image is poorly explored. In this paper, we build an efficient and accurate system for simulating virtual teeth alignment effects in a frontal facial image. Our system takes a frontal face image of a patient with visible malpositioned teeth and the patient's 3D scanned teeth model as input, and progressively generates the visual results of the patient's teeth given the specific orthodontics planning steps from the doctor (i.e., the specification of translations and rotations of individual tooth). We design a multi-modal encoder-decoder based generative model to synthesize identity-preserving frontal facial images with aligned teeth. In addition, the original image color information is used to optimize the orthodontic outcomes, making the results more natural. We conduct extensive qualitative and clinical experiments and also a pilot study to validate our method.
translated by 谷歌翻译
In this paper, we study the \underline{R}obust \underline{o}ptimization for \underline{se}quence \underline{Net}worked \underline{s}ubmodular maximization (RoseNets) problem. We interweave the robust optimization with the sequence networked submodular maximization. The elements are connected by a directed acyclic graph and the objective function is not submodular on the elements but on the edges in the graph. Under such networked submodular scenario, the impact of removing an element from a sequence depends both on its position in the sequence and in the network. This makes the existing robust algorithms inapplicable. In this paper, we take the first step to study the RoseNets problem. We design a robust greedy algorithm, which is robust against the removal of an arbitrary subset of the selected elements. The approximation ratio of the algorithm depends both on the number of the removed elements and the network topology. We further conduct experiments on real applications of recommendation and link prediction. The experimental results demonstrate the effectiveness of the proposed algorithm.
translated by 谷歌翻译